The XS3 Prototype - ReadMe File

Joe TEKLI, SOE, Dept. of Electrical & Computer Eng., Lebanese American University, 36, Byblos, LEBANON
Richard CHBEIR, LIUPPA, IUT of Bayonne, University of Pau and Adour Countries, 64600 Anglet, France





INTRODUCTION
------------


XS3 prototype (implemented using C#) is made up of four main components: a validation component, an edit distance component, a synthetic XML document/grammar generator, and a taxonomic analyzer.



I- Validation component

The validation component verifies the integrity of XML documents and grammars, testing if the documents are well formed with respect to the basic XML syntactic rules [9]. Note that we have considered the following simplifications in order to gain in computation and time complexity:

For XML documents:
  * Empty elements should be handled as PCDATA elements with the constraint of having null contents.
  * Entities are disregarded since they are variables used to define shortcuts to common text. 

For DTDs:
  * The system does not consider recursive declarations
  * The XML Document/DTD comparison processes as well as the synthetic XML document generator do not consider DTDs comprising multiple declarations of the same element, e.g. <!ELEMENT a(b, b)> ...
  * In addition XML Document/DTD comparison processes as well as the synthetic XML document generator only consider the #PCDATA data-type. Attributes are converted to leaf node #PCDATA elements.




II- Edit distance component

The edit distance component undertakes XML similarity computations following our XML/XML [4, 5, 6, 9], XML/DTD [7] and DTD/DTD [8] comparison approaches, as well as other algorithms proposed in the literature, which we refer to as Chawathe [1], DCWS (Dalamagas et al. [2]) and N & J (Nierman and Jagadish [3]).
XS3 underlines four comparison modules:
  * One to One comparison module: Comparing one XML document X1 to another document X2, one DTD grammar D1 to another D2, or one document X1 to a grammar D1), the user being able to choose between our methods and others.
  * One to Many comparison module: Comparing one XML document X1/DTD grammar D1 to a set of XML documents/DTD grammars. This functionality allows the ranking of documents/grammars according to their similarity to X1/D1.
  * Many to Many comparison module: Comparing XML documents/DTD grammars contained in the same folder, one by one. This functionality allows the clustering/classification of similar documents/grammars.
    Details concerning the clustering and classification facilities are provided in the document entitled Clustering and Classification provided among the ReadMe files. 
  * Set comparison module: Comparing sets of XML documents/grammars by computing corresponding inter-set and intra-set average similarity scores. 

An additional module has been recently added for comparing GML (Geography Markup Language) data, and performing basic GML query evaluation.
These modules are integrated in the different system interfaces for comparing XML-based documents and/or grammars.



III- Synthetic XML documents generator

The Synthetic XML/DTD generator produces sets of XML documents and DTD grammars.
The XML generator accepts as input: a DTD definition to be utilized as a class for creating XML documents, a MaxRepeats value designating the maximum number of times a node will appear as child of its parent (when * or + options are encountered in the DTD), as well as a NbDocs value underscoring the number of synthetic documents to be produced.
The DTD generator accepts as input: an optional controlled vocabulary from which to draw DTD element labels (otherwise the system creates the labels synthetically), a NbElems value designating the number of elements in the definition to be produced and a NbOrs value indicating the number of Or operators in the DTD definition (the And operator being utilized by default). The user can also choose to produce definitions with concatenated elements (e.g., <!ELEMENT a ((b | c), d, e, f,)>) or encapsulated ones (e.g.,  <!ELEMENT a (b, c)> <!ELEMENT b (d | e)>, ).

Moreover, we have implemented an XML document/grammar modification generator. It accepts as input an XML document or an XML grammar, a ModifType value designating the kind of modification to be induced to the document/grammar at hand (i.e., element/attribute insertions, deletions or label updates, providing the user with different options w.r.t. each kind of modification, such as preserving the same structural constraints when inserting element/attributes, or specifying how elements/attributes should be deleted, e.g., bottom-up, post-order traversal), as well as a Modif% value indicating the amount of modifications to be produced w.r.t. overall document/grammar size (i.e., cardinality).


IV- Taxonomic analyzer

Furthermore, a Taxonomic analyzer is introduced so as to compute semantic similarity values between words (expressions) in a given taxonomy. Our taxonomic analyzer accepts as input a hierarchical taxonomy and corresponding corpus-based word occurrences. Consequently, concept frequencies are computed and, thereafter, used to compute semantic similarity (via measures in [10, 11]) between pairs of nodes in the knowledge base.
In order to reduce the complexity of semantic similarity evaluation, we chose to pre-compute semantic similarity for each pair of nodes in the taxonomy considered (which took more than 5 CPU hours for a 600 node taxonomy) and store the results in a dedicated indexed table (Oracle 9i DB). The structure of the utilized database, along with sample taxonomic fragments and corresponding concept frequencies, are provided in the Database file, in the 'ReadMe' Directory. Note that the 'TBL_Word_Similarity_Lin' word similarity table (i.e., pair-wise similarities computed using Lins semantic similarity measure [10]) is the one utilized by default in our XML comparison algorithms. Note that in the context of GML search, 'TBL_Word_Similarity_Lin' is exploited for comparing data values (SN_Text), whereas 'TBL_Word_Similarity_Lin_Geo' is ustilized for comparing GML geographic concepts (SN_Geo).

For the time being, the Taxonomic analyzer is implemented separately from remaining XS3 components.





USING XS#3
----------

For the sake of generality, all XML files presented to the prototype should be transformed into text files (.txt). In addition, the whole XML prologue should be removed, or put in comments (including the DTD definitions declarations and entity declarations).  Please view sample documents in the predefined folders.

XS3 comprises of four main interfaces: 
  * XML Document Comparison (and GML search)
  * XML Document and DTD Grammar Comparison
  * DTD Grammar Comparison and Matching
  * Synthetic XML Documents and DTD Grammars Generator



I - XML Document Comparison interface

Its default working directory is 'XMLFolder'. Please view sample files in 'XMLFolder/Buffer'.
The user can test our comparison methods as well as other algorithms implemented in the prototype. Detailed log files, stored in the 'LogFiles' directory, can be generated at the users request. Note that all log files will be deleted automatically when exiting the interface.
When comparing two sets of XML documents, all XML file names must be of the following form: DTDName_FileName. The system utilizes the underscore symbol _ to identify the original sets of XML documents (sets derived from predefined DTDs). 
Timing results are available for all comparison algorithms.
Clustering XML documents: Automated clustering underlines the creation of the entire clustering dendrogram for a given set of XML documents [5, 6]. The minimum similarity value obtained when comparing all pair-wise documents will be utilized as a starting point for the automated clustering process. Itll be consequently multiplied/incremented by a user introduced factor (clustering pace - greater than 1 in the case of the multiplication operator) therefore defining the following clustering levels. This procedure is automatically repeated till attaining the maximum similarity level (=1), therefore completing the dendrogram. The user can also perform single level clustering, by providing a given similarity value comprised between 0 and 1 (designating a specific clustering level). 

With respect to GML search, the default working directory is 'GMLFolder'. It contains two sub-directories 'DataFolder' and 'QueryFolder' for storing GML data and queries respectively. The 'DataFolder' directory comprises of a sub-directory: 'SchemaFolder', containing the GML schema definitions corresponding to the GML documents at hand. Note that with GML search, we consider data values (in constrast with classic XML document comparison where we only focus on document structure for the time being).
While GML documents are similar to classic XML, and GML schemas to classic XML schemas, we introduce some syntax extensions to encode GML query predicates: the '|' symbol underlines values of type String; the ':' symbol underlines values of type Number; and the '!' symbol underlines values of type Date (which are the three basic data-types currently considered in our system). As for the operators, 'e' for equal, 'lt' for lesser than, 'le' for lesser of equal, 'gt' for greater than, and 'ge' for greater or equal. A predicate could be hence encoded as follows:  Foundation < 1950  ==  <Foundation><! lt 1950></! lt 1950></Foundation>.
Farious sample GML documents, schemas, and queries are provided with the system implementation.  


II- XML Document and DTD Grammar Comparison interface

Its default working directory is 'DTD_XML_Comparison'. XML documents are put in 'DTD_XML_Comparison/XMLFolder' and grammars in 'DTD_XML_Comparison/DTDFolder'. Sample files are provided in the corresponding 'Buffer' directories.
Detailed log files, stored in the LogFiles directory, can be generated at the users request. All log files will be deleted automatically when exiting the interface.
Timing results are available for all four comparison algorithms.
Classifying XML documents with respect to predefined DTDs: Automated classification underlines the creation of the entire classification dendrogram for a given set of XML documents [7]. The minimum similarity value obtained when comparing all pair-wise documents will be utilized as a starting point for the automated classification process. Itll be consequently multiplied/incremented by a user introduced factor (classification pace - greater than 1 in the case of the multiplication operator) therefore defining the following classification levels. This procedure is automatically repeated till attaining the maximum similarity level (=1), therefore completing the dendrogram. The user can also perform single level classification, by providing a given similarity value comprised between 0 and 1 (designating a specific clustering level).  
Note that DTDs considered for document/grammar comparison cannot include repeatable expressions (i.e., expressions assigned the ?, + or * constraint operators) due to the corresponding internal representation model (cf. [7] for details).
 


III- DTD Grammar Comparison interface:

Its default working directory is 'DTDFolder'. Sample files are provided in the 'DTDFolder/Buffer' directories.
Detailed log files, stored in the LogFiles directory, can be generated at the users request. All log files will be deleted automatically when exiting the interface. 
Timing results are available for all four comparison algorithms.
Clustering DTD grammars: Automated clustering underlines the creation of the entire dendrogram for a given set of DTD grammars. The minimum similarity value obtained when comparing all pair-wise grammars will be utilized as a starting point for the automated clustering process. Itll be consequently multiplied/incremented by a user introduced factor (clustering pace - greater than 1 in the case of the multiplication operator) therefore defining the following clustering levels. This procedure is automatically repeated till attaining the maximum similarity level (=1), therefore completing the dendrogram. The user can also perform single level clustering, by providing a given similarity value comprised between 0 and 1 (designating a specific clustering level).  



IV- Synthetic XML Documents and DTD Grammars Generator:
Synthetic documents are generated in the 'SyntheticXDFolder/XMLFolder' directory whereas synthetic grammars are generated in 'SyntheticXDFolder/DTDFolder'.






----------------------------------------------------------------------------------------------------------

For further inquiries, please contact us at any of the following email addresses:
joe.tekli@lau.edu.lb or richard.chbeir@univ-pau.fr

----------------------------------------------------------------------------------------------------------


References

[1]  Chawathe S., Comparing Hierarchical Data in External Memory. In Proc. of the VLDB Conference, pp. 90-101, 1999.
[2]  Dalamagas, T., Cheng, T., Winkel, K., and Sellis, T. 2006. A methodology for clustering XML documents by structure. IS Journal, 31, 3, pp. 187-228, 2006.
[3]  Nierman A. and Jagadish H. V., Evaluating structural similarity in XML documents. In Proc. of ACM SIGMOD WebDB, pp. 61-66, 2002.
[4]  Tekli J., Chbeir R. and Yetongnon K., A Hybrid Approach for XML Similarity. In Proc. of the SOFSEM Conf., 2007.
[5]  Tekli J., Chbeir R. and Yetongnon K., A Fine-grained XML Structural Comparison Approach. In Proc. of the ER Conf., pp. 582598, 2007.
[6]  Tekli J. et al., Efficient XML Structural Similarity Detection using Sub-tree Commonalities. In Proc. of SBBD and SIGMOD DiSC, (Best paper), 2007.
[7]  Tekli J., Chbeir R. and Yetongnon K., Structural Similarity Evaluation between XML Documents and DTDs. In Proc. of the WISE Conf., pp. 186-211, 2007.
[8]  Tekli J., Chbeir R. and Yetongnon K., Extensible User-based XML Grammar Matching. Proc. of the 28th Inter. Conf. on Conceptual Modeling (ER09), LNCS 5829, Brazil, pp. 294-314, 2009.
[9]  Tekli J., Chbeir R. and Yetongnon K., An XML Document Comparison Framework. Submitted to Information Systems  Journal, 2008.
[10] Lin D., An Information-Theoretic Definition of Similarity. In Proc. of the 15th Int. Conf. on Machine Learning, 296-304, 1998.
[11] Wu Z. and Palmer M. 1994. Verb Semantics and Lexical Selection. In Proceedings of the 32nd Annual Meeting of the Associations of Computational Linguistics, pp. 133-138.